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In this paper, a fully compressed pattern matching problem is studied. The compression is repre- 
sented by straight-line programs (SLPs), i.e. a context-free grammars generating exactly one string; 
the term fully means that both the pattern and the text are given in the compressed form. The 
problem is approached using a recently developed technique of local recompression: the SLPs are 
refactored, so that substrings of the pattern and text are encoded in both SLPs in the same way. To 
this end, the SLPs are locally decompressed and then recompressed in a uniform way. 

This technique yields an 0((n + m) log M log(?i + m)) algorithm for compressed pattern matching, 
where n (m) is the size of the compressed representation of the text (pattern, respectively), while M 
is the size of the decompressed pattern. Since M < 2 m , this substantially improves the previously 
best 0(rn 2 n) algorithm. 

r/j , Since LZ compression standard reduces to SLP with \og(N/n) overhead and in 0(nlog(N/n)) 

O ■ time, the presented algorithm can be applied also to the fully LZ-compressed pattern matching 

problem, yielding an 0(s log s log M) running time, where s — nlog(N/n) + m log(M/m). 

m 

Tj- " 1 Introduction 

CS| ' Compression and Straight-Line Programms 



Due to ever-increasing amount of data, compression methods are widely applied in order to decrease the 
data's size. Still, the stored data is accessed and processed. Decompressing it on each such an occasion 
basically wastes the gain of reduced storage size; especially that we do not even know in advance, which 
data is relevant to our queries and we decompress many completely irrelevant files. Thus there is a large 
demand for algorithms dealing directly with the compressed data, without the explicit decompression. 
The commonly investigated problem is the compressed pattern matching i.e. a pattern matching in which 
I the text is supplied in a compressed form. 

Processing compressed data is not as hopeless, as it may seem: it is a popular outlook, that compres- 
sion basically extracts the hidden structure of the text and if the compression rate is high, the data has 
a lot of internal structure. And it is natural to assume, that such a structure will help devising methods 
dealing directly with the compressed representation. Indeed, efficient algorithms for fundamental text 
operations (pattern matching, equality testing, etc.) are known for various practically used compression 
methods (LZ, LZW, their variants, etc.) H El US] • 

The compression standards differ in the main idea as well as in details. Thus when devising algorithms 
for compressed data, quite early one needs to focus on the exact compression method, to which the 
algorithm is applied. The most practical (and challenging) choice is one of the widely used standards, 
like LZW or LZ. However, a different approach is also pursued: for some applications (and most of 
theory-oriented considerations) it would be useful to model one of the practical compression standard 
by a more mathematically well-founded method. This idea, among other, lays at the foundations of the 
notion of Straight-Line Programms (SLP), which arc simply context-free grammars generating exactly 
one string. 



'Supported by NCN grant number DEC-2011/01/D/ST6/07164, 2011-2014 and by a scholarship funded by the Human 
Capital Programme. 
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SLPs are the most popular theoretical model of compression. This is on one hand motivated by a 
simple, 'clean' and appealing definition, on the other hand, they model the LZ compression standard: 
each LZ compressed text can be converted into an equivalent SLP with only log(7V/n) overhead and in 
0(n\og(N/n)) time (where N is the size of the decompressed text [H]), while each SLP can be converted 
to an equivalent LZ with just a constant overhead (and in linear time). 

The approach of modelling LZ by SLPs in order to develop efficient algorithms turned out to be 
fruitful: the recent state-of-the-art (and is some sense optimal) algorithm for pattern matching in LZ- 
compressed texts changes the LZ-compression into the SLP-one as its first step [I]. To author's best 
knowledge, there are no algorithms for FCPM specific for LZ, instead, the translation to SLP is used in 
such a case. On the other hand, algorithmic problems for SLP-compressed input strings were considered 
and successfully solved [121 HH HE] ■ 

Problem statement 

The problem considered in this paper is the fully compressed membership problem (FCPM), i.e. we are 
given a text of length TV and pattern of length M , represented by SLPs of size n and m, respectively. 
We are to answer, whether the pattern appears in the text and give a compact representation of all such 
appearances in the text. 

Previous and related results 

The first algorithmic result dealing with the SLPs is for the compressed equality testing, i.e. the question 
whether two SLPs represent the same text. This was solved by Plandowski in 1994 [IS], with 0(n 4 ) 
running time. The first solution for FCPM by Karpihski et al. followed a year later [T2], its main 
drawback was that the proposed algorithm did not return positions of all pattern appearances in the 
text. Next, a polynomial algorithm for computing various combinatorial properties of SLP-gcncrated 
texts, in particular pattern matching, was given by Ga,sieniec et al. [7], the same authors presented also 
a faster randomised algorithm for FCPM [5] ; both these algorithms returned compact representation of 
all pattern appearances. In 1997 Miyazaki et al. [T7] constructed new 0(n 2 m 2 ) algorithm for FCPM. A 
faster 0(mn) algorithm for a special sub-case (restricting the form of SLPs) was given in 2000 by Hirao 
et al. [5]. Finally, in 2007, a state of the art 0(nm 2 ) algorithm was given by Lifshits [T5] . 

Concerning related problems, pattern matching in which the text is compressed using LZW method 
and the pattern is supplied uncompressed was proposed and recently a linear-time algorithm was given [3] . 
A variant in which the pattern is also compressed using LZW was also considered and a linear-time 
algorithm was recently developed [B]. Pattern matching for multiple patterns in LZW compressed text 
was also studied [5]. 

Similar work was carried also for the LZ-compressed text, for which the problem becomes substantially 
harder than in LZW case. In 2011, an 0(nlog(N/n) + m) algorithm, which is in some sense optimal, was 
proposed [I]. 

The paradigm employed in all mentioned work and constructed algorithms, was to consider the com- 
binatorial properties of strings described by appropriate compression methods; our method uses a new 
paradigm. 

Our results and techniques 

We give an 0((n + m) logMlog(n + m)) algorithm for FCPM, i.e. pattern matching problem in which 
both the text and the pattern are supplied as SLPs. This outperforms the previously-best 0(m 2 n) 
algorithm [T5] . 

Theorem 1. Algorithm FCPM returns an 0((n+m) log(n+m)) representation of all pattern appearances, 
where n (m) is the size of the SLP-compressed text (pattern, respectively) and M is the size of the 
decompressed pattern. It runs in C((n + m) logMlog(n + m)) time. The space consumption is 0((n + 
m) log(n + m) ). 

This representation allows calculation of the number of pattern appearances, and if N fits in 0(1) 
codewords, also the position of the first, last etc. pattern; in other case the space consumption increases 
to 0((n + m) \og{N + M) log(n + m)). 

Our approach to the problem is essentially different than all previously applied for compressed pattern 
matching. We do not consider any combinatorial properties of the encoded strings. Instead, we analyse 
and change the way strings are described by the SLPs in the instance. That is, we focus on the SLPs 
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alone, ignoring any properties of the encoded strings. Roughly speaking, our algorithm aims at having all 
the strings in the instance compressed 'in the same way'. To achieve this goal, we decompress the SLPs. 
Since the compressed text can be exponentially long, we do this locally: we introduce explicit strings into 
the right-hand sides of the productions. Then, we recompress these explicit strings uniformly: roughly, a 
fixed pair of letters ah is replaced by a new letter c in both the string and the pattern; such a procedure is 
applied for every possible pair of letters. Since such pieces of text are compressed in the same way, we can 
'forget' about the original substrings of the input and treat the introduced nonterminals as atomic letters. 
Such recompression shortens the pattern (and the text) significantly: roughly one 'round' of recompression 
in which every pair of letters that was present at the beginning of the 'round' is compressed shortens the 
encoded strings by a constant factor. Thus, there are O(logM) rounds. 

Although it is not so hard to believe that this high level idea can work, it is much less believable that 
this can be turned into a fast, efficient and simple algorithm. However, by choosing wisely the parts of 
the text to be recompressed and keeping the overall size of the instance low, we manage to achieve the 
goal. 

Similar techniques While application the idea of recompression to pattern matching is new, related 
approaches were previously employed: most notably the idea of replacing short strings by a fresh let- 
ter and iterating this procedure was used by Mehlhorn et al. [16] . in their work on data structure for 
equality testing for dynamic strings (cf. also an improved implementation of a similar data structure by 
Brodal et al. pQ). They viewed this process as 'hashing'. In particular their method can be straight- 
forwardly applied to equality testing for SLPs, yielding a nearly quadratic algorithm (as observed by 
Gawrychowski [5]). However, the inside technical details of the construction makes extension to FCPM 
problematic: while this method can be used to build 'canonical' SLPs for the text and the pattern, there 
is no apparent way to control how these SLPs actually look like and how do they encode the strings. 

In the area of compressed membership problems, from which the presented method emerge, recent 
work of Mathissen and Lohrey [TS] already implemented the idea of replacing strings with fresh letters 
as well as modifications of the instance so that such replacement is possible. However, the replacement 
was not iterated, and the newly introduced letters could not be further compressed. 

Other applications of the technique A more crude variant of recompression technique has been used 
in order to solve an old open problem regarding fully compressed membership problem for NFAs . 
Furthermore, a variant of this method can also be applied in the area of word equations. While not 
claiming any essentially new results, the recompression approach yielded much simpler proofs and faster 
algorithms of many classical results in the area, like PSPACE algorithm for solving word equations, double 
exponential bound on the size of the solution, exponential bound on the exponent of periodicity, etc. [llj . 

Computational model and positions in text 

Our algorithm uses RadixSort and we assume that the codeword is of size 0(log(n + m)). However, we 
do not make such assumptions on N and M. Changing the model into pointer machine introduces a 
log(n + m) factor to the running time. 

The position of the first appearance of the pattern in the text might be exponential in n, and so it is 
infeasible to output it within the given bounds. However, if we assume that N fits in a constant amount 
of codewords, our algorithm can also output the position of the first, last etc. position of the pattern. 

2 Basic notions, outline of the algorithm 

Straight line programmes 

Formally, a Straight-Line Programme (SLP) is a context free grammar G over the alphabet E with a set 
of nonterminals X, generating a one-word language. For normalisation reasons, it is assumed that G is 
in a Chomsky normal form, i.e. each production is either of the form X — > YZ or X — > a. We denote 
the string defined by nonterminal A by val(A), like value; this notion extends to val(a) for a G (X U E)* 
in the usual way. We also use first[Xi] (lastLYi]) to denote the first (last, respectively) letter of val(A^). 
The tables first [] and last[] are stored by the algorithm FCPM . 

Without loss of generality we may assume that E consists of consecutive natural numbers (starting 
from 1): it is enough to sort the input letters and number them 1, . . . , |E|. During our algorithm, the 
alphabet £ is increased many times and whenever this happens, the new letter is assigned number |E| + 1 
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(and E's size increases by 1 as well). The |E| does not become large in this way: it remains of size 
0((n + to) log(n + to) logM), see Lemma [S] 

For our purposes it is more convenient to treat the two SLPs as a single context free grammar G with 
a set of nonterminals X = {X±, . . . , X n+m }, the text being given by X n+m and the pattern by X m . Wc 
assume, however, that X m is not referenced by any other nonterminal. Furthermore, in our constructions, 
it is essential to relax the usual assumption that G is in a Chomsky normal form, instead we only require 
that G satisfies the conditions: 



We refer to these conditions collectively as ([1]). Let Xj — > on, then a substring u E E + of cti appears 
explicitly in the rule; this notion is introduced to distinguish them from the substrings of val(Xi). Note 
that (TT]) does not exclude the case, when Xi — > e and allowing such a possibility streamlines the analysis. 

The size \G\ is the sum of length of the right-hand sides of G"s rules. The size of G kept by the 
algorithm will be small: 0((n + to) log(n + to)), see Lemma[S] 

There may be exponentially many appearances of the pattern in the text (consider text a 2 and 
pattern a 2 ), and so naive outputting all of them is infeasible. Instead, our algorithm provides an 
0{{n + to) log(m + n)) SLP, in which appearances of a designated letter correspond to the pattern 
appearance in the original instance. 

(Non) crossing appearances 

The main part of the presented FCPM consists of recompression, i.e. replacing strings appearing in 
v&l(X m ) by shorter ones throughout val(Xi), . . . , val(X„ +m ). In some cases, such replacing is harder, 
in other easier. It is intuitively clear, that this depends on the position of the pair with regard to the 
nonterminals: suppose that we are to compress a pair ab. If b is a first letter of some val(X^) and aXi 
appears explicitly in the grammar, then the compression seems hard, as it requires modification of G. 
On the other hand, if none such, nor symmetrical, situation appears then replacing all explicit abs in G 
should do the job. Thus, before stating the algorithm, we introduce classification of pairs into 'easy' and 
'hard'. 

We first formalise the notion, that a nonterminal generates some substring of val(X,). We say that X, 
generates val(JQ) starting at position 1; furthermore, if Xj generates val(X) starting at position p and 
Xj — > aXkdi' , then Xk generates val(JQ) starting at position p + | val(a)|. Symmetrically, Xj generates 
val(Xj) ending at some position. We use this notions only to say that Xi generates pattern (y&l(X m )) or 
text (val(X„ +m )) at some position. 

We say that a letter a € E is to the left of Xi, if, for some position p, a is p-th letter of val(X n+m ) 
(or val(X m )) and Xi generates the text (or pattern, respectively) from position p + 1; in such a case we 
say that Xi is to the right of a. In the symmetric situation, we say that a is to the right of Xi. 

A pair of letters ab is a crossing pair if there is a nonterminal Xi such that a is to the left of Xi and 
first [Xi] = b or, symmetrically, b is to the right of Xi and last[X,-] = a; otherwise ab is non-crossing. 
Intuitively ab 'crosses' the symbols in some production uval(Xj)vval(Xk)w. Unless explicitly written, 
we use this notion only to pairs of different letters. 

The notions of (non-) crossing pairs is usually not applied to pairs of the form aa, instead, for a 
letter a G E we say that a e is a a's maximal block of length I, if there exist two letters x,y G E, where 
x =/= a =^ y such that xa e y is a substring of v&\(X n+m ) (or val(X m )). We say that a letter a G E has a 
crossing block, if the pair aa is crossing. The crossing pairs and letters with crossing blocks are intuitively 
hard to compress. 

The definition of the crossing pairs (and letters with crossing blocks) is very 'global' in the sense that 
is uses val(X„_|_ m ) and val(X m ). However, it turns out that the set of crossing (non-crossing) pairs, letter 
with (without) crossing blocks can be easily established by reading G. The number of such pairs (blocks) 
is linear. 

Lemma 1. There are at most 2(n+m) different letters with crossing blocks and at most 4(n+m) different 
crossing-pairs and at most \G\ noncrossing pairs. For a letter a there are at most \G\ +4(tt, + to) different 
lengths of a's maximal blocks in val(Xi), . . . , val(JT„ +m ). 

The set of crossing (non-crossing) pairs can be calculated in 0(\G\) time. 



each Xi has exactly one production, which has at most 2 notcrminals, 
if Xj appears in the rule for Xi then j < i, 

if val(Xi) = e then Xi is not on the right-hand side of any production,. 



(la) 
(lb) 
(1c) 
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Outline of the algorithm 

The main operations of our algorithm are two types of compressions performed on strings encoded by G: 

pair compression of ab For two different letters ab appearing in va,l(X m ) replace each of ab in val(Xi), 
. . . , val(X n+m ) by a fresh letter c. 

a's block compression For each maximal block a 1 , with I > 1, that appears in val(X m ), replace all 
a s in val(Xi), . . . , val(X n+m ) by a fresh letter ag. 

We adopt the following notational convention throughout rest of the paper: whenever we refer to a 
letter at, it means that the last blocks compression was done for a and ai is the letter that replaced a . 

We call the £th iteration of the main loop, i.e. the one in line[TJ the £th phase. Ideally, each phase of 
FCPM compresses each consecutive letters into one letter, this gives logM iterations of this loop. This 
is true, up to a constant factor. 

Lemma 2. There are OQogM) executions of the main loop of FCPM . 



Algorithm 1 FCPM : outline 
1: while |val(X m )| > 1 do 
2: P ^— list of non-crossing pairs 
3: P' <— list of crossing pairs 
4: L <!— list of letters 
5: fix the beginning and end 
6: for each ab £ P do 
7: compress pair ab 

8: for ab £ P' do 
9: compress pair ab 

10: for each a £ L do 

11: compress blocks of a 

12: Output the answer. 



Remark Notice, that pair compression of ab to c is in fact introducing a new nonterminal with a 
production c — > ab, similarly, block compression for a introduces new nonterminals with rules at —> a . 
Hence, FCPM creates new SLPs for text and pattern. This justifies the name 'recompression' used for 
the whole process. 

Still, these new nonterminals are never expanded by FCPM and are always treated as individual 
symbols; thus it is better to think of them as letters. In particular, the running time analysis of FCPM 
use the fact that no new nonterminals are ever introduced to G. 

Major challenges 

Before we proceed to describing the details of FCPM , we would like to point out, what are the main 
problems we are dealing with. The non-crossing pair (and blocks) compression are easy to implement 
and are not an issue: it is enough to read G and replace the appropriate explicit strings. When it comes 
to a crossing pair compression, a simple transformation of the instance changes the crossing pair ab into 
a non-crossing one: whenever a is to the left of Xj and val(Xj) = bw we modify the productions for 
Xj, so that val(Xi) = w and replace Xi by bXi in every rule; similar transformation are applied to the 
nonterminals Xj to the left of b such that val(Xj) = w'a. This makes ab a noncrossing pair. 

Similar approach works for crossing blocks compression, this time though we need to remove a-prcfix 
(a-suffix) from each nonterminal to the right (left, respectively) of a. This removes all crossing blocks 
of a so that it blocks can be compressed. Notice, that this is all easy to perform, except that we may 
introduce explicit blocks of a that have exponential length to G. These can be conveniently represented: 
a 1 is simply denoted as (a,£), with I encoded in binary. 

The ends of val(X rn ) have to be treated somehow special: consider pattern abc and text aabecb. When 
aa is replaced by 02, and cc by C2 the obtained text 026026 no longer contains the pattern, which is still 
abc. This is fixed by enforcing that the leading pair of the pattern (ab in this case) is compressed as first. 
The situation complicates, when the val(X m ) begins with an ^-block of a, in this case we tune the block 
compression a little. 
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Simplifications 

There are some simplifications and additional assumptions made in the extended abstract, done in order 
to increase the readability; the full version of this paper has no such simplifications nor additional 
assumptions. Simplified statements: The lemmas are stated in a simplified way, omitting some of the 
technical details, but highlighting the intuitively important properties. Size of code-word: We assume 
that N and M fit in 0(1) code words. This allows the explicit calculations of the lengths of val(X m ) and 
val(X n+m ). In the full version of this paper the same results are shown under the weaker assumption 
that n and m fit in 0(1) code words. 

3 Details 

In this section we describe the details of implementation of Algorithm [1] and analyse its running time for 
such an implementation. 

Grammar 

The grammar kept by FCPM is closely related to the input one: 

SLP The set of used nonterminals is a subset of X = {Xi, . . . , X n+m } and the productions are of the 
form described in l[T]), 

FCPM preserves (|SLP[) . in particular, we always assume, that the input of the subroutines satisfies (jSLPI) . 
We assume more for the input instance: we want it to obey the Chomsky normal form, instead of the 
relaxed conditions (H|). 

Compression of non-crossing pairs 

We start by describing the compression of a non-crossing pair ab, as it is the easiest to explain. Intuitively, 
whenever ab appears in string encoded by G, the letters a and b cannot be split between nonterminals. 
Thus, it should be enough to replace their explicit appearances. 

Lemma 3. The non-crossing pairs compression can be performed in 0(|G|) time. 

We read G and list all pairs' appearances and flag them, depending on whether these appearances 
are crossing or not. We then group these appearances by the pair, i.e. for a fixed pair we have a list of 
all appearances of this pair. For a fixed non-crossing pair ab, we go through the corresponding list of 
appearances and replace each explicit ab in G by a fresh letter c. 

Algorithm 2 LeftPop (JQ 

1: let Xi — > a and b the a's first symbol 

2: remove leading b from a 

3: replace each Xi in the rules by bXi 



Compression of crossing pairs 

Let ab be a crossing pair because a is to the left of nonterminal Xi such that first [Xi] = b. To remedy this 
we 'pop' the leading b from X^. if val(Xj) = bw we modify G so that val(Xi) = w. This is implemented in 
LeftPop . Such a procedure is applied to each non-terminal that is to the right of a. Symmetric procedure 
is applied for a letter b and nonterminals Xi such that b is to the right of Xi and last[X,] = a. 
When the pair ab is no longer crossing, it can be compressed in the way described above. 

Lemma 4. The PairComp properly compresses a crossing pair ab. 



Algorithm 3 PairComp (ab) 
l: for i <— 1 . . m + n do 

2: if a is to the left of Xi and first [Xi] = b then 
3: LeftPop (Xi) 

4: if b is to the right of X, and last[Xj] = a then 
5: RightPop (Xi) 

6: compress the pair ab 
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Changing all crossing pairs to noncrossing ones can be done in parallel, similarly as in the case of 
the compression of the noncrossing pairs. However, this can be done under the assumption that these 
pairs do not overlap, where ab and a'b' overlap if a = b' or b = a' . The general case is obtained by 
partitioning all crossing pairs into 2 log(n + m) groups, such that within each of the groups the pairs are 
not overlapping. The partition is found by a simple greedy method, similar to approximation of a vertex 
cover. 

Lemma 5. Pairs from P' can be partitioned into 21og(n+m) groups, such that performing the PairComp 
(with appropriate implementation) for pairs in one group takes 0(\G\) time. 

Blocks compression 

Now, we turn our attention to the block compression. Suppose first that G has no letters with a crossing 
block. Then a procedure similar to the one compressing non-crossing pairs can be performed: when 
reading G, we establish all maximal blocks of letters. We group these appearances according to the 
letter, i.e. for each letter a we create a list of a's maximal blocks in G and we sort this list according to 
the lengths of the blocks. We go through such list and we replace each appearance of a 1 by a fresh letter 

However, usually there are letters with crossing blocks. We deal with this similarly as in the case of 
crossing pairs: a letter a has a crossing block if and only if aa is a crossing pair. So suppose that a is to 
the left of Xi and first [X,] = a, in such a case we left-pop a letter from X^ In general, this does not solve 
the problem as it may happen that still first[Xj] = a. So we keep on left-popping until first[Xj] ^ a. In 
other words, we remove the a-prefix of val(Xj). Symmetric procedure is applied to Xj such last[Xj] = a 
and Xj is to the left of a. 

It turns out that even a simplified approach works: for each nonterminal Xi, where first [Xi] = a and 
last[Xj] = b, it is enough to 'pop' its a-prefix and 6-suffix, see RemCrBlocks . 

Observe that during the procedure, long blocks of a (up to 2 n+m ) may be explicitly written in the 
rules. This is conveniently represented: a £ is simply denoted as (a,£), with £ encoded in binary. When I 
fits in one code word, the a 1 representation is still of constant size and everything works smoothly. 



Algorithm 4 RemCrBlocks : removing crossing blocks. 
1: for i <— 1 . . m + n, except n and n + m do 
2: let Xi — > a>i be the production for Xi 
3: let a = first [Xj] 

4: calculate and remove the a-prefix a li of on 
5: let b = last[Xj] 

6: calculate and remove the &-suffix b Ti of on 
7: replace each Xi in rule's bodies by a ll Xib ri 
8: if val(Xj) = e then 
9: remove Xj from the rules' bodies 



After RemCrBlocks , every letter a has no crossing blocks and we may compress maximal blocks using 
the already described method. 

Lemma 6. After application of RemCrBlocks there are no crossing blocks. The time consumption of 
RemCrBlocks and following block compression is 0{\G\). 

First and last letter of pattern 

We have to treat the 'ends' of the pattern in a careful way: consider a text ababa and a pattern bab. 
Then compression of ab into c results in a text cca and pattern be, which no longer appears in the text. 
The other problem appears during the block compression: consider pattern aab and text aaab. Then 
after the block compression the pattern is replaced with 02b and text with a^b. 

In general, the problems arise because the compression applied by FCPM is done partially on the 
pattern appearance and partially outside it, so it cannot be reflected in the compression of the pattern. 
We say, that the compression spoils pattern's beginning (end) when such partial compression appears 
on pattern appearance beginning (end, respectively). In the working example, spoiling of the pattern 
beginning can be circumvented by enforcing a compression of the pair ab in the first place: when two 
first letters of the pattern are replaced by a fresh letter c, then the beginning of the pattern no longer 
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can be spoiled in this phase (as c will not be compressed in this phase). We say, that pattern's beginning 
(end) is fixed by a pair or block compression, if after this compression a first (last, respectively) letter of 
the pattern is a fresh letter. Notice, that the same compression can at the same time fix the beginning 
and spoil the end: for instance, compressing ba into c does so in the working example. Our goal is to fix 
both the beginning and the end, without spoiling any of them. 

If the first two letters of the pattern are ab for a ^ 6, then we can fix the beginning by compressing the 
pair ab, before any other pairs (or blocks) are compressed. This cannot be applied if val(X m ) has a leading- 
s-block of letters a, for I > 1. The problem is that each m-block for m > I can begin an appearance of 
the pattern in the text. This is circumvented by applying a tuned version of block compression: observe, 
that there is no apparent reason, why we needed to replace a m by exactly one letter. Replacement by 
several letters is fine, as long as they 'do not mix' in the output. To be more specific: 

• each m-block of length m < I is replaced by a fresh letter a m and 

• each m-block of length m > t is replaced by a pair of letters a m ae. 

For instance, in the example of aaaba and aaba above we obtain 0302601 as new text and 02&ai as a new 
pattern; clearly the pattern has an appearance in the text. In this way we fix the pattern beginning. 

When first [X m ] ^ last[X m ] then fixing the beginning does not spoil the end and afterwards we simply 
fix the end in a symmetrical way. When firstLY m ] = lastLY m ] we need to apply a mixture of these two 
techniques, but still both the beginning and the end can be fixed, without prior spoiling. Roughly, when 
a e and a r are the a-prefix and a-suffix (where, without loss of generality, I > r > 1) we first make the 
block compression of a, in which m-blocks are replaced with: 

• a m for m < r; 

• a r a m for r <m < i\ 

• a r ai for m = £; 

• a r a m ai for m > £. 

Unfortunately, for some values of £, r (for instance, take I = r = 1), this might actually enlarge the text 
(in the example, a is replaced by a r ai). However, by enforcing compression of pairs of the form agb and 
ba r for b € S and some simple tricks, the compression can be achieved. 

Lemma 7. In 0(|G|) time we can fix both the beginning and end without prior spoiling them. 
Grammar and alphabet sizes 

The subroutines of FCPM run in time dependant on \G\ and we bound these sizes. 

Lemma 8. During FCPM , \G\ = 0{{n + m) log(n + m)) and |E| = 0((n + m) log(n + m) log \M\). 

The proof is straightforward: using an argument similar to Lemma [2] we show that the size of each 
rule shortens by a constant factor in each phase. On the other hand, only LeftPop , RightPop and 
RemCrBlocks introduce new letters to the rules and it can be estimated, that in total they introduces 
0(log(n + m)) letters to a rule in each phase. Thus, bound 0(\og(n + m)) on each rules' length holds. 
Concerning |E|, new letters appear as a result of a compression. Since each compression decreases the 
size of \G\ by at least 1, there are no more than \G\ of them in a phase, which yields the bound. 

Memory Consumption 

FCPM uses 0((n + m) log(n + m)) space, the same holds if we want to retrieve first/last positions etc. 
of the pattern, under the assumption that N and M fit in 0(1) codewords. If only n and m fit in 0(1) 
codewords, the space consumption increases by a factor representing the length of text and pattern, i.e. 
log(iV + M). 

Sketch of the main proof 

The cost of one phase of FCPM is 0(\G\ + (n + m) + (m + n)log(n + m)), by Lemmas [31 00 while 
Lemma [S] shows that \G\ = 0((n + m) log(n + m)) and Lemma [5] shows that there are 0(logM) phases. 
So the total running time is 0((n + m) log M log(?7 + m)). 
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Appendix 



A General comments and remarks 

Notational convention 

In order to distinguish between the input and output of procedures in the proofs, we use 'primed' symbols, 
(like G' , X[, o^), for the parts of output and 'unprimed' (so G, X,-, at) for the parts of the input. 

Sorting 

In the subroutines of FCPM we sort the data several times, since the input can be identified with 
relatively small numbers, we use RadixSort for that. To recall, RadixSort sorts strings of digits (we can 
think of them as numbers without leading zeroes) according to the order: 

• if s is shorter than s' than s is smaller; 

• is s and s' are of the same length and Sk and s' k are their left-most different positions then s is 
smaller than s' if and only if s k < s' k . 

If the lengths of strings are £±, £2, ■ ■ ■ , £r and the alphabet of digits consists of 0, 1, . . . , u c for some 
constant c, we can sort these strings in 0(c(J2i?i + u )) time. In all our considerations, c is a constant 
not larger than 3. 

Compression 

We denote the string obtained from w by blocks compression for a by BC a {w) and by pair compression 
of a pair ab into c by PC a f,_ >c (w). 

For shortness, we say that some procedure realises pair compression for a, b, or realises PC a b-^c, if 
for each i = 1 , . . . , n + m it holds that 

PC ah ^ c (val(X 4 )) = valpQ, 

where X- denotes the nonterminal in the grammar after the application of the aforementioned procedure. 
Similar notion is introduced for blocks compression. 

Notice, that if some procedure realises pair compression for a pair ab, where a ^ last[X m ] and 
b 7^ first [X m ]. (or blocks compression for a such that first [X m ] last[X m ]), then the appearances of 

val(X^J in val(X' n+m ) are in 1-to-l correspondence with the appearances of val(X m ) in val(X n+rn ). An 
appropriate modification for the excluded variant is presented later. 

There is a remark to be made: if we perform the blocks compression for a that does not appear in 
val(X m ), then we could replace different blocks of a 1 with different letters, and for the obtained grammar 
G' it would still hold that the appearances of v&l(X' m ) in val(X^ +m ) are in 1-to-l correspondence with 
the appearances of val(X m ) in val(X„ +m ). Since this property is actually the one we aim at, we relax 
the notion of realising the blocks compression: if a does not appear in the pattern, we may replace blocks 
of a 1 in val(X„ +m ) with arbitrary letters. Similar relaxation is done for block compression for blocks b , 
where £ > | val(X TO )| and b is not the last, nor the first letter of the pattern. 

Furthermore, when fixing the beginning and end, we use a slightly tuned version of block compression, 
in which a block may be replaced by several letters, not just by one. In such case we still say that a 
procedure realise the block compression with regard to this variant of the block compression. 

Letter weights 

In some cases it is useful to consider the weight of letters. Weight is extended to strings in a natural 
way. Every letter a in the input grammar has w(a) = 1, while each new letter a that replaces the string 
w is assigned w(a) = w(w). We treat weights as a tool of analysis and usually do not calculate them. 
However, when N fits in a constant amount of code words, the weight of each letter can be calculated in 
constant time. 
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No neighbouring nonterminals 



In the design and analysis and subprocedures it is convenient to assume that there are no nonterminals 
next to each other in any rule, hence we do so. All described procedures can be extended so that they 
work also in the general case, however, it is perhaps more explanatory to show that every instance can 
be transformed into an instance that 

• is only constant times larger, 

• defines the same pattern and text. 

This is done by the following simple preprocessing algorithm. 



Algorithm 5 Preprocessing 
l: for i <— 1 . . n + m, except n and n + m do 
2: let Xi — i> a and a = first [Xj] 

3: remove leading a from a*, replace each X,; in the rules' bodies by aXi 
4: b = lastpfi] 

5: remove ending b from on, replace each Xi in the rules' bodies by Xib 
6: if v&l(Xi) = e then 

7: remove Xi from the right-hand sides of the rules 



The analysis of this algorithm is straightforward and is done in a similar way as analysis of LeftPop 
in Lemma S] and RemCrBlocks in Lemma [5J so it is omitted. 

A.l Local definition of to the right and crossing pairs 

We say that Xi is directly to the right of letter a, if there is a rule Xj — > aj such that aXi appear explicitly 
in aj. Similar definition is given for a nonterminal directly to the left of a letter. 

Lemma 9. Xi is to the right of a if and only if 

• it is directly to the right of a 

• there is a sequence of nonterminals X^ , Xi 2 , . . . , Xi k — Xi such that X^ is directly to the right 
of a and for each 1 < j < k X;. +1 is the first symbol in the rule for JQ. . 

Symmetrical characterisation can be given for nonterminals that are to the left of a letter. 

Proof. Straightforward. □ 

Claim 1 . If ab is a crossing pair (we allow the possibility that a = b) then there exists a nonterminal Xi 
with a rule Xi — > on such that 

• the string aX^ appears in cti where b — first [Xk] or 

• the string X^b appears in cti where a = last[Afc] 

Conversely, in any of the cases described above the pair ab is crossing. 

If ab is a noncrossing pair then there is a production Xi — > cti such that ab is an explicit substring of 

Oil. 

Proof. The converse implication follows straight from Lemma [HI 

Concerning the main implication. Let ab be a crossing pair. By symmetry, we can consider only the 
case in which there is a nonterminal Xi to the right of a such that first [Xi] = b. By Lemma |H] cither X, 
is directly to the right of a, which ends the proof, or there are nonterminals X^ , X,- 2 , . . . , Xi k = Xi such 
that is directly to the right of a and Aj. +1 is the first symbol in the rule for However, the latter 
condition shows that first[AiJ = first[Ai 2 ] = . . .first[AiJ = first[X;] = b. Hence, firstLXJ = b. 

Consider now a noncrossing pair and a nonterminal Xi such that ab appears in val(X) but not in 
val(Xj) for j < i. Then ab appears explicitly in the rule for Xi. □ 
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B Omitted proofs 



Lemma [Tl Consider a grammar, in which blocks of a letter can be represented as a single symbol. There 
are at most 2(n + m) different letters with crossing blocks and at most 4(n + to) different crossing-pairs 
and at most \G\ noncrossing pairs. For a letter a there are at most \G\ + 4(n + to) different lengths of 
a's maximal blocks in val(A"i), . . . , val(X rl+rn ). 

The set of crossing (non-crossing) pairs can be calculated in 0(\G\) time. 

Proof of Lemma\l\ Observe that if a has a crossing block then first [Xi] = a or last[A",] = a for some 
nonterminal Xi. Since there are n nonterminals, there are at most 2n letters with crossing blocks. 

By Claim [T] if ab is a crossing pair then it can be associated with aXi, where first [Xj] = b or Xib, 
where last[X,] = a. Since there are at most only 4(n + to) such substrings, the total number of crossing 
pairs is at most 4(n + to). On the other hand, by the same claim, if ab is a noncrossing pair then ab 
appears explicitly is some of the rules of the grammar. And there are at most \G\ such substrings. 

The argument for maximal blocks of a is a little more involved, consider first maximal blocks that 
have an explicit appearance in the rules of G, the nonterminals also count for ending maximal blocks. 
Then each letter (or block of letters that are represented as one symbol) is assigned to at most one 
maximal block and so there are not more than \G\ such blocks, so not more than \G\ different lengths. 
Assign other blocks to nonterminals: a block a 1 is assigned to Xi with a rule Xi — > uXjvXkW, if a e is a 
maximal block in val(JQ), but it does not in val(X,) nor in val(Xfc). Thus, there are four possibilities for 
a block to be assigned to the rule: 

• a letter a from this maximal block is the last letter of u and a = first [Xj], 

• a = last[Aj] and a letter a from this maximal block is the first letter of v, 

• a letter a from this maximal block is the last letter of v and a = first[Afe], 

• a = last[Xfe] and a letter a from this maximal block is the first letter of w. 

Hence, there are at most 4 maximal blocks assigned to Xi in this way. 

The 0(|G|) time algorithm for obtaining a list of crossing and noncrossing pairs is described in 
Lemma [3] (noncrossing pairs) and Lemma O (crossing pairs) □ 

Lemma [2l There are O(logAf) executions of the main loop of FCPM . 

Proof of Lemma [Jl We allow a sligthly more general version of the compression: whenever a pair (or 
block) is compressed, we can also replace a preceding (succcding) letter b with a fresh letter b' and forbid 
to compress any appearances of b' . 

Consider any 4 consecutive letters abed, appearing in the val(A m ) at the beginning of loop starting 
in line[T] We show, that at least one of these four letters is compressed before the next execution of this 
loop. In this way, if we partition val(A" m ) into blocks of 6 consecutive letters, each block is shortened by 
at least one letter in each iteration of the loop from linc[TJ Thus the length of val(A m ) decreases by a 
factor of 5/6 in each iteration and so this loop is executed at most 0(logA/) times. 

Claim 2. For any four consecutive letters appearing in some val(X^) (or right-hand side of production 
for Xi) at the beginning of the phase at least one of these letters is compressed in this phase. 

Proof. We do not distinguish, whether the four letters appears in a rule or in a defined string, as the 
proof is the same. 

Consider 4 consecutive letters abed and assume for the sake of contradiction, that none of them is 
compressed during this iteration of the loop. 

Consider, whether b is forbidden to compress. If so, then this means that either a or c was compressed, 
which ends the proof. Similarly, if c is forbidden to compress, then we are done. So assume that b and c 
are not forbidden to compress. The analysis splits, depending on whether b — c or not. 

b = c Then this pair of consecutive letters is going to be compressed in line 111! contradiction, depending 
on whether be is crossing or not. 

b ^ c Then the pair be is assigned to either P or P' and is going to be compressed in line [7] or in line 
Contradiction. □ 

The proof follows from Claim [51 □ 
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Lemma [3l The non-crossing pair compression can be performed in 0{\G\) time. 
The compression of non- crossing pairs introduces at most \G\ new letters. 

proof of Lemma\Q We go through the list productions of G. Whenever we spot an explicit pair ab, we 
put (a, 6, 0,p) in the list of pairs' appearances, where indicates, that this appearance is non-crossing 
and p is the pointer to the appearance in G. 

For each rule it is easy to list the crossing pairs associated with this rule (see Claim [T]): when aXib 
appears in some rule, we list the pairs af and tb with flags indicating, that they are crossing, where / 
(£) is the first (last, respectively) letter in val(Xi). 

Then, we sort these tuples lexicographically, using RadixSort in 0(|G|) time: by Lemma[S]the size of 
£ is polynomial in n + m, and RadixSort sorts the tuples in 0(|G| + n + rn) = 0(\G\) time. Thus, for 
each pair we obtain a list of its appearances. Moreover, when sorted, we can establish in 0(|G|) time, 
which pairs are crossing and which non-crossing. 

For a fixed non-crossing pair ab, we go through the associated list and use pointers to localise and 
replace all appearances of ab. If this pair is no longer there (as one of letters ab was already replaced) , 
we do not nothing. For a crossing pair, we do nothing. 

If G is organised as a list, the pointers can be manipulated in constant time, and so the whole 
procedure takes 0(|G|) time. 

Since each introduction of a new letter shortens the description of G by at least 1, in total there are 
at most |G| new letters introduced. 

It is left to show, that the described actions are proper, i.e. that they realise PC. 

Claim 3. The above description realises the PC for non-crossing pairs. 

Proof. Fix the order on the pairs, in which they are considered by the procedure. We show that the 
procedure realises the pair compression for each of those pairs. So consider a fixed pair ab. Recall that 
in order to distinguish between the nonterminals before and after the compression of ab we use 'primed' 
nonterminals, i.e. X[ 1 for the nonterminals after this compression and 'unprimed', i.e. Xi, for the ones 
before. 

We show by induction on i that 

valpQ = PC ab ^ c (yal(Xi)). 

Indeed, this is true when the production for Xi has no nonterminal on the right-hand side (recall the 
assumption that a ^ b), as in this case each pair ab on right hand side of the production for Xi was listed 
when the procedure read the bodies of the rules and there is a link to each such appearance of ab (note, 
that there may be some links to appearances that ceased to exists, as one of their letters was already 
compressed during the compression of some other pairs). Then each such pair is replaced by c and so 
val(X0 = PC ab ^c(yal(Xi)). When Xi -» uX jV X k w, then 

val(Xi) = uval(Xj)vval(Xk)w and 

valpQ = PC ab ^ c {u)v a \(X' j )PC ab ^ c {v)v & \{X' k )PC ab ^ c {w) 

= PC ab ^ c {u)PC ab ^ c {val{X' j ))PC ab ^ c {v)PC ab ^al[X , k )), 

with the last equality following by the induction assumption. Notice, that since ab is a non-crossing pair, 
all occurrences of ab in val(X;) are contained in u, v, w, val(X,) or val(Xfe), as otherwise ab is a crossing 
pair, see Claim [TJ which contradicts the assumption. Thus, 

PG ab ^ c (val(X t )) -PG a ^ c ( t i)PG a ^ c (val(Xj))PG a ^ c («)PG ab ^ c (val(^))PG a ^ c ( W ), 

which shows that PG b_s. c (val(Xi)) = val(X z '), ending the proof of the claim. □ 

□ 

Lemma [4l The PairComp properly compresses a crossing pair ab. 

Proof of Lemma [^} Since we already known from Lemma that for noncrossing pairs the replacement 
of each appearance of ab in G by c properly realises the pair compression, it is enough to show that 
PairComp makes ab a noncrossing pair, before compressing it. 

Observe that LeftPop is not fully defined, as it always assumes that the rule for Xi begins with a 
letter. However, during the proof we show that this is indeed the case whenever it is invoked. 

Let, as in the code, b be the first symbol in the rule X — > on. Observe that after LeftPop (Xi) it holds 
that 
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• val(X'j) = val(Xj) for j ^ i; furthermore Xj generates the text (pattern) starting at position p if 
and only if Xj does, 

• bval(X^) = val(Xi); furthermore, Xi generated the text starting at position p in the text (pattern) 
if and only if b is the letter at position p in the text (pattern, respectively) and X[ generates the 
text (pattern, respectively) starting from position p + 1 . 

In particular, a is no longer to the left of X[. 

Hence, after applying the LeftPop to every nonterminal to the right of a such that first[Xi] = b the 
letter a is not to the left of any such noterminal. Observe also, that LeftPop does not change the 'end' 
of the word val(X,), i.e. Xi generates the string ending at position p if and only if X[ does. Symmetric 
analysis shows that after applying RightPop to every nonterminal Xi to the left of b such that last^] = a 
the b is no longer to the left of such Xi. Consequently, ab is no longer a crossing pair. □ 



Algorithm 6 Left-popping a letter for Xi to the right of a, DFS-likc style 



1: L <— list of nonterminals directly to the left of a 

2: for Xi e L do 

3: if Xi is still directly to the right of a then 
4: LeftPopDFS (Xi) 

5: 

6: LeftPopDFS [Xi) 

7: let Xi — } on 

8: if oti starts with a nonterminal then 
9: let a.i start with Xj 
10: LeftPopDFS (Xj) 

11: let the first letter of a* be b 

12: remove this b from on 

13: replace each Xi in the rules by bXi 



o Described below 



Lemma [5l Pairs from P' can be partitioned into log(n + m) groups, such that performing the PairComp 
(with appropriate implementation) for pairs in one group takes 0(\G\) time. 
Processing one such group introduces at most 4 new letters to each rule. 

Proof. We first describe, how one such group looks like, then how to perform PairComp efficiently for 
pairs in one group and lastly, how to make the appropriate partition. 

One group P" C P' is defined by a partition of alphabet £ into two disjoint subalphabets and S r . 
Then we consider the pairs P" = {ab G P' : a G G S r }. Notice, that in this way the pairs in P" 
cannot 'overlap': if ab G P" then a! a ^ P" for every possible a' G S and similarly ba' ^ P" . Intuitively, 
this allows making the pair compressions for pairs in P" in parallel in 'one go'. 

There are two problems with implementation of LeftPop : firstly, we do not know how to identify the 
nonterminals that are to the right of a, secondly, LeftPop is undefined, when we apply it to Xi whose 
rule begins with another nonterminal. Let us investigate this situation: let Xi be a nonterminal to the 
right of a and let it rule be X{ — > a.;. If a» begins with a letter, then we can safely apply LeftPop to 
Xi. However when a, begins with a nonterminal, say Xj, then Xj is also to the right of a (note that if 
Xi generates a string starting at position p then so does Xj). Hence, we should apply LeftPop on Xj 
as well; furthermore, when LeftPop processed Xj, then the rule Xi — >• ai is changed so that a'i does not 
start with Xj, but instead with a letter. Furthermore, Lemma asserts that if Xj is to the right of a 
then either it is directly to the right of a or there is nonterminal Xi to the right of a and with a rule 
Xi — > ai such that Xj is the first symbol in ai. 

The discussion above shows that the loop[5]in the PairComp , which pops a letter from each nonterminal 
to the right of a, can be implemented in a DFS-like style, cf. Algorithm® we ensure that each Xi directly 
to the right of a is processed, furthermore, if a rule Xi — > ai of processed Xi begins with a nonterminal, 
this nonterminal is processed before Xi. First observe that LeftPop is equivalent to the lines [TlTfTSl of 
Algorithm® Now to show the equivalence of whole Algorithm[(5]to the loop starting in linc[5]of PairComp 
it is enough to show that lines [TTI - fTSl of Algorithm [5] are performed exactly once per each nonterminal to 
the right of a. It is obvious from the discussion above that they are performed at least once, we show that 
they are performed at most once. Suppose for the sake of contradiction, that LeftPopDFS was invoked 
two times on some Xi, consider the earliest such second invocation (over all Xis). There are two cases: 
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either the second call was done by a recursive call from some other invocation of LeftPopDFS or because 
Xi was in the list L. 

recursive call The second invocation is done by a recursive call from LeftPopDFS (Xj) for some other 
nonterminal Xj. This means that the rule for Xj begins with Xi. By the choice of i, this is the 
first invocation of LeftPopDFS on Xj and there was a previous invocation on Xi. However, the 
first application of LeftPopDFS to Xi replaced Xj in the rule for Xj by bXi, for some letter b, 
and this b was not removed from the rule for Xj. And thus the rule for Xj cannot start with Xj, 
contradiction. 

in the list As in the previous case, since this is the second invocation the first one replaced Xi in all 
rules by bXi for some b. Observe that since LeftPopDFS was applied to Xi, ab G P" and so a £ S( 
and b G E r . It was checked in the line [3] that Xi is to the right of a, however, b is the only letter to 
the left of Xi. Thus a = b, contradiction, as E^ and E r are disjoint. 

Hence, Algorithm [5] is equivalent to loop starting in line[2]in the PairComp . The loopF|]can be of course 
implemented in a symmetrical way. The running time will be addressed later on. For now, we show a 
crucial property, that during the processing of pairs from P" LeftPopDFS is applied to each Xi at most 
once. 

Claim 4. Consider several runs of LeftPopDFS for letters {a | ab G P" for some 6} intertwined with 
compression of pairs in P" . Then LeftPopDFS is run on Xi at most once in all these calls 

Proof. Suppose that it was run twice and consider the second invocation of it on Xi. The condition in 
line [3] implies that right before the second application, Xi is to the right of some a G E^. We show that 
this is not the case. 

After the first invocation of LeftPopDFS on Xi the Xi was replaced in all rules by bXi for some b 
such that ab G P" for some a. In particular b £ E r . So right after the end of the first invocation on Xi, 
the letter b G E r is the unique letter to the left of Xi. Since E^ and E r are disjoint, it is not from E^. 

There are two ways in which the letter to the left of Xi can change: this letter is compressed or 
LeftPop is applied to Xj. Since we consider the second invocation of LeftPop on Xj, by definition there 
was no invocation between the first one and the second one. Concerning the other possibility: the newly 
introduced letter by definition is clearly not in E^ U E r . 

Hence, a letter from E^ cannot appear to the left of Xj, contradition. □ 

Organisation of P" We now describe how the appearances of pairs from P" are represented. The P" 
itself is stored as a list. For a fixed pair ab G P" we have a list to pointers to all its appearances in G, 
there are three types of such appearances: 

explicit appearance for an explicit appearance of ab we keep a pair of pointers, to a and to b; 

crossing appearance with explicit a for each appearance of aXj in G, such that first [Xj] = b we keep 
a pair of pointers, one to a and one to Xf, 

crossing appearance with explicit b for each appearance of Xj& in G, such that last[Xj] = a we keep 
a pair of pointers, as in the case above. 

To perform the compression of ab we simply go through the list of ab appearances. When we have a 
pointer to an explicit pair, we replace it by a new letter c (if both a and b are still there, they could have 
been compressed by some other pairs) . When we spot a pair of pointers to a and Xj we check, if they 
are still next to each other. 

are next to each other We apply LeftPopDFS (Xj), which will replace Xj with 6Xj. We then replace 
the newly obtained pair ab by c. 

are no longer next to each other This can happen only when LeftPopDFS has been already applied 
to (Xj). In such a case, there is 6 between a and Xj. We replace this pair by c. 

The actions for a third case, when the pointers are to Xj6, are similar as in the previous case. 
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Replacement running time We separately calculate the cost of the replacement of the pairs and the 
cost caused by the runs of LeftPopDFS . Consider the former. Each pair replaced corresponds to one 
appearance of ab (or aXi or Xib) in G. Hence the total cost of the replacement (summed over logn lists 
P") is at most 0{\G\). 

Concerning the cost of LeftPopDFS , we distribute the cost of calls (including the recursive calls) as 
follows: 

• the cost putting a nonterminal in the list L in line [1] is charged to its appearance in G; 

• the cost of the recursive call on X, is charged to Xi 

Note, that the former cost is in total 0(|G|); the latter is at most 0(n + m), as both LeftPopDFS and 
RightPopDFS are run on Xi at most once, see ClaimFU Hence, the total running time is C(|G| + n + m), 
as desired. 

Creating P' and P" It is left to show that we can partition the pairs into 2 log(n + m) groups, we 
furthermore show that we can partition them in time C(|G|). The groups are found by a greedy algorithm 
by GreedyPairs , see Algorithm [JJ It resembles a simple approximation algorithm for vertex cover. 



Algorithm 7 GreedyPairs 

1: L <— set of letters used in P 1 
2: k = 1 

3: while P' do 

4: Y>i <s— E r <— > Organised as a bit vector 

5: for a G L do 

6: if \{b ab G P' or ba G P',b G E r } > \{b 6a £ P' or ab G P', & G then 

7: > Greedily choose the one that gives more pairs to P" 

8: E £ «- E £ U {a} 

9: else 

10: E r «- E r U {a} 

11: P"[k] <- {ab\a eZ e ,b e E r , ab G P'} 
12: P"[k + 1] <- {ab | a G E r , 6 G E^, a6 G P'} 

13: P' <- P' \ (P"[k] U P"[fc + 1]) 
14: k ^ k + 2 



Since in each iteration GreedyPairs makes a partition of E into two disjoint sets, each of the returned 
group P"[k] satisfies the required condition. Secondly in each iteration the size of P' halves: this follows 
by a standard argument applied to the choice done in lines [7HTTJI 

It is left to analyse the running time of efficient variant of GreedyPairs , the only non-obvious operation 
is calculation in line 171-0111 as well as creating P" [k] in line [TT] (and P" [k + 1] in the next line) , all other 
operation clearly take at most 0(|P'|). 

The list P' is organised as follows: for each letter a that appears in some pair in P' we have a right 
list right (a) = {b \ ab G P'}, represented simply as a list. Furthermore, the element b on right list has a 
link to a list of all appearances of the pair ab in the rules of G (the list was already defined, when P" 
was described); there is a similar left list left(a) = {b | ba G P'}. Wc comment, how to create left lists 
and right lists later. The right lists and left lists are kept up to date, i.e. after creating P" we remove the 
corresponding elements from right lists and left lists. List P" is organised in the same way as P', except 
that it has only the right lists, to avoid having duplicates. 

Given right(a) and left(a), performing the check in line [7J is easy: we are to compare the numbers 
|{6 | ab G P',b G E r }| and \{b | ba G P',b G S^}|, i.e. calculate the size of the intersection of right(a) 
with E r and the size of the intersection of left (a) with E^ . To this end it is enough to read the right(a) 
(left(a)) list and check for its elements, whether they are in E r (E^, respectively). When Ef and E r arc 
organised as bit vectors, this is done in time 0(\right(a) \ + \ left(a)\). 

When partition of E into E^ and E r is finished, we want to create P"[k] and P[k + 1] and update P 1 . 
This is also easy, we describe only the actions for P"[k], the ones for P"[k + 1] arc similar. Wee read P' 
again, we read b from right(a): 

• if a G E^ and b G E r we remove b from right{a) in P' and add it to right{a) in P"[k], together with 
the links to appearances of ab; 
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• in other case we do nothing. 
When reading b from the left list of a: 

• if b G T,i and a G E r we remove 6 from left(a); 

• in other cases we do nothing. 

We add the appearances only to right(a) according to the definition of P" , in this ways we avoid dupli- 
cation. Again, these operations can be done in time 0(\right{a)\ + \left(a)\). 

Since X) q <e£ \^ e ft( a )\ + l™7^( a )l = \P'\, we obtain that the creation of P" and update of P' take 
0(|P'|) time. Since the number of pairs in P' halves in each iteration, this yields a total running time of 
0(n + m), as initially P' has 0(n + m) elements, see Lemma [TJ 

It remains to show how to initially create P' . This is done similarly as creating a list of noncrossing 
pairs in LemmaH wc sketch it for completeness. To this end we read G, when reading a pair ab we create 
a record (a, b, where p is a pair of pointers to this appearance and 1 denotes that this appearance of 
ab is noncrossing. When reading aAj, where first [Aj] = b, we create a record (a,b,0,p) as before, where 
denotes that this pair is crossing; we create a similar record for Xjb, where last[X,] = a. We then sort 
these record lexicographically using RadixSort . As £ = 0((n + to) log(n + to) log \M\) by Lemma El and 
the total number of pairs is at most 0(|G| + 4(n + to)), by Lemma [1] this can be done in time C(|G|). 
Now, for a fixed letters a, b, the consecutive tuples with two first coordinates equal a and b correspond 
to all appearances of ab in G. Reading them we can find out, if ab is crossing or not. If it is, we add b to 
right(a) and attach these elements as the list of appearances of ab in G. The left lists are created in the 
same way. □ 

Lemma [6l After RemCrBlocks there are no crossing blocks. This algorithm and following block 
compression can be performed in time 0(\G\) and introduce at most 4 new letters to each rule. 

Proof. We first show the first claim of the lemma, i.e. that after RemCrBlocks there are no letters with 
crossing blocks. This follows from two observations: 

1. when RemCrBlocks considers Xi then for j ^ i the val(X,) does not change, i.e. val(Xj) = val(Aj). 
Furthermore, Xj generates a string starting (ending) at position p if and only if Aj does. 

2. after RemCrBlocks considered Xi, and Aj is to the left (right) of a then a ^ first[Xj] (last[Aj] ^ am, 
respectively). 

Both properties follow by a simple induction on the number i of considered nonterminal. 

We infer from these observations that after RemCrBlocks there are no crossing blocks in G. Suppose 
for the sake of contradiction, that there are; let a be the letter that has a crossing block. By symmetry 
we consider only the case, when there is Xi such that a = first[Aj] and Xj is to the right of a. Let p be 
the position such that a is on this position and Xi generates the string starting at position p + 1. By 
the observation Q] during the whole run of RemCrBlocks a is on position p. Consider the i-th iteration 
of RemCrBlocks , when Xj is considered. Then in all iteration after the i-th one A, generates the string 
starting at position p + 1, by the observation [T] Furthermore, by observation [21 right after the i-th 
iteration, Aj is not to the right of letter first [Xi], contradiction. 

Concerning the claims of the running time, wc shall first analyse a simplified version of the algorithms, 
in which 

• we assume that all lengths of blocks considered in RemCrBlocks fit in one code word, in particular, 
operating on these lengths takes constant time. 

• we give a slightly larger time bound 0(\G\ log |G|), which is due to sorting. 

We then show that using more refined approach we can reduce the compression time to linear one (by 
using RadixSort ) and also extend the result to blocks of larger length. 

Observe that in this simplified scenario RemCrBlocks is performed in C(|G|) time: assuming that 
we represent block a 1 as a pair (a,£), the length of the a-prefix (6-suffix) is calculated simply by reading 
the rule until a symbol other than first [Aj] (last[Xj], respectively) is found. Since there are at most 4 
symbols introduced by RemCrBlocks to the rule, this takes at most C(|G|) time. The replacement of Aj 
by a li Xib ri is done at most twice inside one rule and so takes in total 0(n + to) time. 
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Note that right after RemCrBlocks it might be that there are neighbouring blocks of the same letter 
in the rules of G. However, we can easily replace such neighbouring blocks by one block of appropriate 
length by in one reading of G, in time 0(|G|). 

Concerning the compression of the blocks of letters, it is done in a similar way as a compression 
of non-crossing pairs, see Lemma [3] For the sake of completeness, we present a sketch: We read the 
description of G. Whenever we spot a maximal block a for some letter a, we add a triple (a, £,p) to the 
list. The p is the pointer to this appearance of the block in G. Notice, that here the nonterminals are 
counted for termination of maximal blocks. 

After reading the whole G we sort these pairs lexicographically, in time 0(|G| log |G|). We go through 
the sorted list, for a fixed inner letter a, we use the pointers to localise a's blocks in the rules and we 
replace each of its maximal block of length £ > 1 by a fresh letter, i.e. number |E| + 1. Since the blocks 
of a are sorted according to their length, all blocks of the same length are consecutive on the list, and 
replacing them by the same letter is easily done. 

Since we already know that there are no letters with crossing block, we can show, as in Lemma [3J 
that this procedure realises the block compression. The simple proof, which is essentially the same as the 
proof in Lemma [31 is omitted. 

General case 

As already noted, we should improve the algorithm above so that it takes into the account two additional 
factors: 

• the lengths of the a blocks can be exponential in n + m and so we cannot assume that we can 
operate on them in constant time; 

• the C(|G| log |G|) sorting time should be replaced by C(|G|), this should hold for the large numbers 
introduced above. 

We deal with both problems at the same time, by introducing a special representation of the lengths of 
a blocks. 

However, before giving this representation, observe that we simply cannot consider blocks of letters 
that are longer than M: in such case, even creating the representation of their length takes more time 
than log M and it cannot be amortised in any way. However, such blocks cannot appear in the pattern, 
and so we may treat them with disregard, as explained in detail below 

Letters not appearing in the pattern and blocks longer than M 

Consider a letter a that does not appear in the val(X m ). Then there is no guarantee that if a appears 
in val(X n+m ) then I < M . On the other hand, since a does not appear in the pattern at all, there is no 
difference, whether we replace two of appearances of a 1 with the same letter, or with different letters. In 
fact, we do not even need to replace the a e by a single letter, we may replace it by arbitrary strings in 
different places in G (though we do want the new representation to be shorter than the previous one, so 
that Lemma [2] works even for such 'sloppy' compression). 

Thus, for a that does not appear in val(A m ) we perform a 'sloppy' blocks compression: we treat each 
maximal block as if it had a unique length. To be precise: we perform RemCrBlocks , but represent a 
blocks as (a, ?) for I > 1. Then, when replacing blocks of a (we exclude the blocks of length 1), we replace 
each of them with a fresh letter. In this way, the whole blocks compression does not include any cost 
of calculating the lengths of blocks of a. Nevertheless, such a modified procedure still properly realises 
block compression. 

Consider now a that appears in val(X m ), but has blocks of length greater than M in some other 
val(Xj). We treat them similarly: as soon as we realise that a has i > M, we represent the blocks as 
(a, > M) and do not calculate the exact length. The only difference is that we note that they are longer 
than M. 

Representation of blocks 

The intuition is as follows: while the a blocks can have exponential length, most of them do not differ 
much, as in most cases the new blocks arc obtained by concatenating letters a that appear explicitly in 
the grammar. In this way, the length of the blocks can increase in total by |G|. Still, there are blocks 
of exponential length: these 'long' blocks are created only when two blocks coming from two different 
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non-terminals are concatenated. However, there are only n + m concatenations of nonterminals,' and 
so the total number of 'long' blocks 'should be' at most n + m. Of course, the two mentioned ways of 
obtaining blocks can mix, and our representation must take this into the account: we represent each 
block as a concatenation of two blocks: 'long' one and 'short' one: 

• the 'long' corresponds to a block obtained as a concatenation of two nonterminals, such a long block 
is common for many blocks of letters, 

• the 'short' one corresponds to concatenations of letters appearing explicitly in G, this length is 
associated with the given block alone. 

More formally: our implementation stores a list of powers, i.e. the lengths of common long blocks of 
letters. These numbers can be exponential in n + m and so they are stored as bit vectors. In particular, 
we assume the operations on a power p, including basic arithmetic operations, take O(logp) time. 

Then each length I of the a-prefix a is represented as a sum p + o, where p is one of the powers 
and o {offset) is a number associated with I. We show that each offset is at most |G|, in particular, the 
operation on it take linear time and it is stored in 0(1) codewords. Internally, a is represented as a 
number o and a pointer to p. Furthermore, some blocks are represented only by offsets; we sometimes 
think of them as if they were represented by as power and an offset. 

We discuss, when the new blocks are represented as new powers, and when as old powers with larger 
offset. Initially a power p is created for each nonterminal X;, such that val(Xj) = a p , for some letter 
a, we then store p. Next, other powers are created, when we add two powers, i.e. when during the 
calculation of length we add lengths that are both represented using powers. If we concatenate explicit 
letter a to a block, we simply increase the offset. The blocks that are created solely by explicit letters a 
are represented by offsets alone, without a power. 

Before proceeding, let us note on how large the offsets may be and how many of them are. 

Claim 5. In total there are at most \G\ offsets and the largest of them is at most \G\. 

Proof. Observe that an offset is created, when an for some letter a an explicit letter a (not in a compressed 
form) is concatenated to the block of as. One letter is used once for this purpose, so there are at most 
\G\ offsets in total. By the same argument it follows that the largest offset is at most \G\ □ 

Since we intend to sort the lengths, it would be good to be able to compare the lengths of two numbers 
represented as powers with offsets, say o + p and d + p' . Since the powers are so large, we expect that 
we can compare them by comparing lexicographically, i.e. 



Furthermore (J2|) allows a simple way of sorting the lengths of maximal blocks: 

• we first sort the powers (by their values) 

• then for each power we (separately) sort the offsets assigned to this power. 

Unfortunately, (0) need not to be true: Even though pi and pj are so large, it can still happen that \p— p'\ 
is small, as they were obtained in a different way. Still, after sorting the powers it is easy to improve 
the representation, so that © holds. To this end we need to perform only two simple operations: let 
Pi < Pi < • • • be the sorted list of powers. Then we want to 

reassign offsets if o+pi > pt+i then we represent o+pt using Pi+i instead, i.e. as (o+ (Pi+i ~Pi)) 

After performing this operations thoroughly, ^ holds. 

Reassigning offsets is easy to do in a bottom-up fashion, however, one offset can be reassigned several 
times (if many powers are close to each other). To avoid several reassignment, we first modify the list of 
powers, so there are no two pwoers that are close to each other: 

removing close powers if p = Pi+i — Pi is small, i.e. at most \G\ then we remove Pi+i; consequently, 
we replace each o + Pi+i by (o + p) + pi 

As a first step in removing the close powers, we sort them. 
Claim 6. Let p\ < p2 < • • • < Pk be the powers that are at most M. The time needed to sort them is 




or 



(2) 



0(ELilogfe) + |G|). 
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This is done by a standard implementation of RadixSort that sorts the numbers of different length, 
the running time is 0(k + 2 + X}j=i ^°s{Pi))> which is as promised. It remains to add the powers that are 
larger than M : we append them at the and of the list in an arbitrary order. 

Define Ai = Pi+\ — p%. For implementation reasons, we trim it: if Aj < |G|, we store it explicitly, if 
Aj > \G\ we simply represent it as +00. Similarly, when Pi+\ > M, we represent Aj as +00. Then given 
a sorted list of piS we can in log(pj) + |G|) calculate such trimmed values Ai, A 2 , . . .. This is 

done by naive subtraction of pi from Pi+i and then by trimming. Observe that such subtraction takes at 
most 0(logpi+i) time. Now it is easy to 'remove the close powers', see code of FixPowers . 



Algorithm 8 FixPowers : remove the close powers 
1: let p± < P2 < ■ ■ ■ be the sorted list of powers, add p$ = 0, Pk+i =00 > For simplicity 

2: i *r- 1 

3: while i < k + 1 do 



4: A <h- 0, j <r- 1 

5: while A + Aj < \G\ do > If Aj = +00 then this clearly fails 

6: A<- A + Aj, j <-j' + 1 

7: for offset o assigned to pj , representing I do 

8: O <- O + A 

9: assign o to pi, represent I as o + pi 

10: remove pj 

11: A.; <r- A + Aj , i i- j + 1 



it is easy to see that FixPowers has the following properties 

• the value of the representation of I does not change, 

• after FixPowers every two powers differ by more than \G\, 

• each offset is increased by at most \G\, so after after FixPowers the maximal offset is at most 2\G\. 
Now, the reassigning offsets is also easy, see FixOffsets . 



Algorithm 9 FixOffsets : reassigning offsets 

l: let pi < P2 < ■ ■ ■ be the sorted list of powers > Assume that A^ > \G\ 

2-. for i <— 1 . . k — 1 do 



3: for offset o assigned to pi, representing I do 
4: if o > Ai then 

5: O <- O - Ai 

6: assign o to Pj+i, represent I as o + Pi+i 



It is easy to see that 

• as each offset is at most 2\G\ and each Ai > \G\, each offset is reassigned at most once, 

• the numerical value of the representation of I does not change, 

• if o is assigned to pi then o + pi < Pi+\. 

The last property show that after FixOffsets the @ holds. Furthermore, the cost is constant per offsets 
and so at most 0(\G\). In order to sort all lengths it is enough to sort the offsets within groups. To save 
time, we do it simultaneously for all groups: offset Oj assigned to power pi is represented as (i,Oj), we 
sort these pairs lexicographically, using RadixSort . Since the offsets are at most 2\G\ and there are at 
most \G\ of them, RadixSort sorts them in C(|G|) time. 

Running time 

Let pi, P2, ■ ■ -Pk be all powers created during RemCrBlocks . Let us recall the costs of dealing with these 
powers. 

• the creation of the power pi takes time at most log(min(pi, M)), as we stop the calculation of the 
exact value of p as soon as its representation using bitvector is longer than log AI + 1; 
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• the sorting takes time linear in X)j=i l°g( mm (Pi? M)); 

• all other operations have time proportional to \G\. 

Then C(|G|) is fine, as we upper bound the cost of RemCrBlocks , and so we can charge this cost 
this against it. The 0(%2 i=1 log(min(pi, M))) is divided between powers: we associate a cost (linearly 
proportional to) log(min(pi, M)) with pi and then redirect the cost towards the rule, in which pi was 
created. 

Claim 7. For a single rule, the cost redirected from powers towards this rule during the whole run of 
FCPM is O(logM). 

Proof. Let us first consider the cost of dealing with powers that are larger than M and powers for letters 
not appearing in the pattern. For the latter, we do not calculate the lengths of powers at all, and so they 
do not bring any cost to the current analysis. The former, however, introduce some cost: if p > M then 
we spend log M time when finding this out might take up to log M time. To be more precise, let p be 
obtained as a sum of p' + d and p" + o" . If one of p' , p" is larger than M then we already know it and 
in constant time mark that p is too large. However, if p',p" < M then the cost of calculation can be 
recharged to the larger of p' , p": one of them is at least p/2 — \G\ and so this increases the assigned cost 
by a constant factor. Hence, we can consider only the cost of dealing with powers that are at most M. 

The creation of the rule can remove a nonterminal from this rule, which happens when Xj is in the 
rule defines a string in a*. Then the cost is at most logAf and such cost can be charged twice to a rule, 
as there are two nonterminals in the rule and there is no way to introduce a new nonterminal to a rule. 

It can happen that no nonterminal is removed from the rule, even though a new power is created: 
this happens only, when in the X^s rule Xi — > uXivXjW both the a-suffix of val(Xj) and the a-prefix of 
val(Xk) are represented using the powers of a, moreover, v € a*. 

Consider all such calculations performed on this rule. Recall that we already accounted the cost of 
block compression of the letters not appearing in val(X m ) and compression of blocks that are longer 
than M. sLet the consecutive compressed letters be aS 2 \ . . . , with multiplicities pi, P2, ■ ■ ■ , pi- 
Lastly, the pi repetitions of are replaced by a^ e+1 \ Observe, that a^ +1 - ) does not need to be cS l \ as 
there might have been some other compression in between. 

Recall the definition of weight: for a letter it is the length of the represented string in the original 
instance. Consider the weight of the strings between Xj and X^. Clearly, after the i-th blocks compres- 
sion it is exactly piw(a^). However, we claim that w(a^ +1 ^) > piw(a^): right after the i-th blocks 
compression the string between Xj and Xk is simply Op} . After some operations, this string consists of 
somepi+i letters a (t+1 \ Since the operations in the whole FCPM do not remove the symbols from string 
between two nonterminals in a rule, nor do they decompress the letters there, it holds that 

w(a(' i+1 >) > w(a$) =Piw(a,W). 

Thus w(a^ +1 )) > Ili=i-P«- Still, by our assumption we consider only the cost of letters that appear in 
the pattern. Hence, d p+1 ^ (or some heavier letter) appears in the pattern, and so M > w(val(X m )) > 
w(a p +i). Hence, 

t 

log(M) > ^logp,. 

?:=i 

Therefore, the whole charge of X)i=i ^°EPi to the single rule is in fact at most logM. □ 
Summing over the rules gives the total cost of 0((n + m) logM), as claimed. □ 

Lemma In OQG\) time we can fix both the beginning and end without prior spoiling them. This 
introduces 0(1) letters to each rule. 

Proof. The applied action are essentially different, when first [X m ] = a =/= b = last[X m ] and when 
first [X m ] = last[X m ], with the former being much simpler. We describe them separately. 

First and last letter are different 

We assume that first [X m ] = a ^ b = last[X m ]. 
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Fixing the beginning 

Let a' be the second letter of val(A m ). Suppose first that a' ^ a. As already described, we perform the 
compression of the pair aa' . Regardless of whether it is crossing or not, using approach similar to the 
one in Lemma [3] (for non-crossing pair) or Lemma [5] (for a crossing pair) gives a 0(\G\) algorithm for a 
compression of this pair. Note that virtually no changes are needed in these two procedures: we simply 
make compression of exactly one pair and not of several pairs simultaneously This fixes the beginning 
in this case. Observe, that aso^ last[A m ] also the end of the pattern is not spoiled in the process. 

So suppose that a' = a, let a 1 be the a-prefix of vsd(X m ). The we perform the compression of blocks 
for the letter a. The algorithms described in Lemma [5] can be easily upgraded, so that it considers only 
the letter a and do not make any changes with regard to other letters. In this way the running time still 
is C(|G|). However, there is one change, already described in the main part of the paper: 

• each m-block of length m < I is replaced by a fresh letter a m , 

• each m-block of length m > £ is replaced by a pair of letters a m ai. 

We should show that in this was no pattern appearance was lost, nor that any new pattern appearance 
was introduced. So let val(X n+m ) = wia m w 2 w^ and val(X m ) = a e W2, for m> £ Let Wi be replaced by 
w[. Then the new text is w^amaew^w^ and the new pattern is a e w' 2 , thus there is a pattern appearance 
in the new text. The case in which m = £ is shown in the same way. 

Conversely, let w^aiw^w^ be the new text and a£w' 2 the new pattern. Then the pattern was obtained 
from a W2 for some w 2 . Furthermore, w'-^a 1 was obtained from some wia m for m > £ (this is the only way 
to obtain ai), also, the only way to obtain w 2 is from the same w 2 . Hence, no new pattern appearance 
was introduced. Therefore this realises the block compression. 

To implement this approach we need to, for each maximal block a m , known whether m > £ or not. 
This is easy, as the block compression explicitly sorts the lengths of maximal blocks, so the ones longer 
than £ can be identified, see LcmmalHl Clearly the modified procedure properly implements the (modified) 
block compression. 

As in the previous case, this fixed the pattern beginning and as last[X m ] ^ a, it did not spoil the 
pattern end. 

Note, that Lcmma[2]works with this modified version of the block compression: while we can introduce 
two letters to represent blocks, this happens for blocks of length at least 3. So we can think that the 
middle letter is the one that is not compressed. 

Fixing the end 

We want to apply exactly the same procedure at the end of the val(A m ). However, there can be some 
perturbance, as fixing the beginning might have influenced the the end: 

• the last letter could have been already compressed, which can happen only when b = a' . In this case 
we got lucky and we make no additional compression, as the end of the pattern has been already 
fixed. 

• the second last letter (say b 1 ) of val(X m ) equals a' and it was compressed, into the letter c (cither 
due to pair compression or block compression). In this case we make the compression of the pair 
ca', even though c is a fresh letter. Note, that as c is the first letter of this pair, this will not spoil 
the beginning of the pattern. 

The rest of cases, as well as the analysis of the above exceptions, is the same as in the case of fixing the 
beginning. 

Last and first letter of pattern are the same 

It remains to discuss, what actions are applied, when firstLY m ] = a = last[A m ]. We assume that the 
pattern is not of the form a . this special case is dealt with separately in Theorem [T] 

Let a and a r are the a-prefix and a-suffix (where, without loss of generality, £ > r > 1) of val(A m ). 
Wc first make the block compression of a, in which 

• ?7i-blocks for m < r are replaced by a m ; 

• r-blocks are replaced with an 



23 



• m-blocks for r < to < t are replaced by a^a m \ 

• ^-blocks are replaced by anah- 

• ?n-blocks for to > £ are replaced by and m o.L- 

The letters an and ol are different, even if r = I. A similar analysis as in the case when first [X m ] ^ 
last[X m ] and we made block compression shows that running time is C(|G|). Furthermore, both the 
beginning and the end are fixed. 

However, in this way we have enlonged the strings in the instance, in the worst case, when I = r = 1 
we replace a single a with anah- Our goal is to compress each a^ with a following letter and each an 
with a preceding letter. 

As the first operation, we want to make a compression of pairs (some of them perhaps crossing) of 
the form {a^b \ ajja appears in val(A n+m ) or val(A m )}. Note, that this list of pairs is already of the form 
required from P" in the construction in the Lemma O with Y,£ = {a^} and S r = £ \ {a^}. The list 
of such pairs can be created in C(|G|) and in the same time bounds we can perform the compression of 
these pairs, see Lemma [SJ Notice that in this way we do not spoil the end, as a^ ^ an- 

Similarly, we then perform the compression of the pairs {ban \ ban appears in val(A n+m ) or val(X m )}, 
in the same time bounds. This does not spoil the beginning. 

We show that the process of fixing the beginnings and ends in the case of first[A m ] = a = last[A m ] 
can be seen in an alternative way: we perform a usual pair and block compression, but whenever a 
pair (block) is compressed, we replace a neighbouring letter b with b' and forbid it to take part in the 
compression. And such restrictions are in fact allowed in the proof of Lemma [5] 

We inspect, how the maximal blocks of a are replaced. Consider some maximal a block appearance 
a m and letter b (c) preceding (succeeding, respectively) it. Note, that it may happen that b = c. Without 
loss of generality, assume that £ > r. 

• if m < r then ba m c is replaced by ba m c and b, c can be further replaced. This is equivalent to 
standard compression. 

• if r = to then ba r c is replaced by banc and an is then compressed with b into b'c and c can be further 
compressed. For r — 1 this is equivalent to pair compression, for r > 1 this is in fact stronger. 

• if r < m < I then ba r c is replaced by bana m c, then an is compressed with b, and so we obtain 
b'a m c, and we can further compress c. Note, that for m > 2 this is equivalent to performing the 
pair compression for ba and then making the block compression for the remaining m — 1 letters. For 
m = 2 this is equivalent to compression of pair ba and replacing the following a with uncompressiblc 
a'. 

• if m = £ then ba r c is replaced by banALC, then an is compressed with b and is compressed with 
c, and so we obtain b'c' , and we cannot compress 6', c' . Note, that this is equivalent to replacing 
a e c with c', b with b' and declaring that b' cannot be compressed. 

• if m > £ then ba m c is replaced by bana m aLC, then an is compressed with b and az, is compressed 
with c, and so we obtain b'a m c' , and we cannot compress b', c'. Note, that this is equivalent to 
replacing a m with a m , then replacing b and c with uncompressible b' and c'. 

Concerning the weights of the letters in the following analysis (note that replacing a by anai causes 
some problems to the weights), we treat fixing the letters as making compression and forbidding the pairs 
from compression, as explained above. In this way, there is no need to divide the weights. □ 

Lemraai During FCPM , \G\ = 0((n + m) log(n + m)) and |S| = <D((n + m) log(n + m) log \M\). 

Proof of Lemma\^ We begin with showing the bound on \G\. Consider a rule of G. On one hand, its 
size drops, as we compress letters in it. On the other, some new letters are introduced to the rule, by 
popping them from nonterminals. We estimate both influences. 

Observe that Claim [2] applies to the bodies of the rules and so an argument similar to the one in the 
proof Lemma [2] can be used to show that the size of the explicit strings in the rules decreases by a constant 
factor in each phase. Of course, the newly introduced letters may be unaffected by this compression. 
By routine calculations, as the rule's length decrease by a constant factor, if C(log(n + to)) letters are 
introduced to the rule, the rule's size is also O(log(n + to)) (with a larger constant, though). Hence it is 
left to show that 0(log(ra + to)) are introduced to a rule in one phase. 
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We now estimate, how many new letters were introduced into the a fixed rule of G in one phase. We 
do not count the letters that merely replaced some other letters (as a compression of maximal block or 
a pair compression), but only the letters that were popped into the rules. 

As first, the subprocedure for fixing the beginning and end is used, by Lemma [7] it introduces 0(1) 
new letters to a rule. In noncrossing pair compression there are no new letters introduced. Concerning the 
crossing pairs compression, by Lemma 0] at most 21og(ra + to) letters are introduced to each rule during 
the crossing pair compression. Lastly, when RemCrBlocks is applied, it introduces at most 4 new symbols 
into a rule. These can be long blocks of some letter, but they are immediately afterwards replaced by one 
letter, thus, at most 4 new letters are introduced in this way. In total, this gives 0(log(n + to)) letters 
per rule. 

Concerning the alphabet, the time used in one phase, is 0((n + m)\og(n + m) + |G|), which is 
0((n + m) \og(n + to)). Thus no more than this amount of letters is introduced in one phase. There are 
0(logM) phases, see LemmalU and so a bound 0((n + to) log M log(n + to)) on |E| follows. □ 

Theorem[TJ Algorithm FCPM returns a 0((n+m) log(n+m)) representation of all pattern appearances, 
where n (m) is the size of the SLP- compressed text (pattern, respectively) and M is the size of the 
decompressed pattern. It runs in 0((n + m) logAf log(rt + to)) time. 

This representation allows calculation of the number of pattern appearances, and if N fits in 0(1) 
codewords, also the position of the first, last etc. pattern. 

proof of Theorem^ The running time was already discussed in the main part of the paper. Observe that 
since all subprocedures properly implement pair compression and block compression, no appearances of 
the pattern are lost or 'gained'. One spacial case needs some explanation, though: when the pattern 
is of the form a at the beginning of the phase (which can be checked during the attempt of fixing the 
beginning), we simply perform the block compression and mark appearances of a m for to > i. In such 
case this letter represents m — £ + 1 appearances of the pattern. 

As noticed, the positions of the patterns are in general hard to calculate, as they may depend on N. 
However, everything is much simpler, when we can assume that N fits in a constant amount of code words. 
This is formalised in the second claim of the theorem that regards the operations on the appearances, 
when N fits in 0(1) code words. Recall the notion of weight of a letter: for each input letter a we define 
w(a) = 1. Moreover, if b represents a compressed string w, w(b) is defined as w(w). Then all weights 
of letters are at most N and so can also be stored in 0(1) code words and so we calculate them during 
the compression, with only constant overhead. Then the position of appearances, number of appearances 
etc. can be calculated using simple bottom-up dynamic programming on the SLPs representing the final 
text and pattern. □ 
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